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Abstract 


Community detection is a fundamental problem in network analysis with many 
methods available to estimate communities. Most of these methods assume that 
the number of communities is known, which is often not the case in practice. 
We propose a simple and very fast method for estimating the number of com¬ 
munities based on the spectral properties of certain graph operators, such as the 
non-backtracking matrix and the Bethe Hessian matrix. We show that the method 
performs well under several models and a wide range of parameters, and is guar¬ 
anteed to be consistent under several asymptotic regimes. We compare the new 
method to several existing methods for estimating the number of communities and 
show that it is both more accurate and more computationally efficient. 


1 Introduction 

The problem of clustering similar objects into groups is a fundamental problem in data analysis. In 
network analysis, it is known as community detection ll^ l2l [8l f3l . Given a network, which consists 
of a set of nodes and a set of edges between them, the goal of community detection is to cluster the 
nodes into groups (communities) so that nodes in the same community share a similar connectivity 
pattern. 

One of the simplest ways of modeling a community structure is the stochastic block model (SBM) 
m. Given the number of communities K, n node labels Ci are drawn independently from a multi¬ 
nomial distribution with parameter tt = (tti, ..., ttk)- The edges between pairs of nodes (i, j) are 
then drawn independently from a Bernoulli distribution with parameter Pacj and collected in the 
nx n adjacency matrix A, with Aij = 1 if nodes i and j are connected by an edge, and 0 otherwise. 
A limitation of the stochastic block model is that all nodes in the same communities are equivalent 
and follow the same degree distribution, whereas many real networks contain a small number of 
high-degree nodes, the so called hubs. To address this limitation, the degree-corrected stochastic 
block model (DCSBM) ifTTl assigns a degree parameter di to each node i, and edges between nodes 
are drawn independently with probabilities OiOjPacj- The community detection task is to recover 
the labels Ci given the adjacency matrix A. 

A large number of methods have been proposed for finding the underlying community structure ll22l 
125] m 0 im E] a [IS mim i29i . Most of these methods require the number of communities K as 
input, but in practice K is often unknown. To address this problem, a few likelihood-based methods 
have been proposed to estimate K |[l3][T9||27l|30||33], under either the SBM or the DCSBM. These 
methods use BIC-type criteria for choosing the number of communities from a set of possible values, 
which requires computing the likelihood, done using either MCMC or the variational method, which 
are both computationally very challenging for large networks. A different approach based on the 
distribution of leading eigenvalues of an appropriately scaled version of the adjacency matrix was 
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proposed by Glllol. Under the SBM, distributions of the leading eigenvalues converge to the Tracy- 
Widom distribution; this fact is used to determine K through a sequence of hypothesis tests. Since 
the rate of convergence is slow for relatively sparse networks, a bootstrap correction procedure was 
employed, which also leads to a high computational cost. A cross-validation approach was proposed 
by Ha, which requires estimating communities on many random network splits, and was shown to 
be consistent under the SBM and the DCSBM. 

To the best of our knowledge, all existing methods are either restricted to a specific model or compu¬ 
tationally intensive. In this paper we propose a fast and reliable method that uses spectral properties 
of either the Bethe Hessian or the non-backtracking matrices. Under a simple SBM in the sparse 
regime, these matrices have been used to recover the community structure CSlISlllIOl; it was also 
observed that the informative eigenvalues (i.e., those corresponding to eigenvectors which encode 
the community structure) of these matrices are well separated from the bulk. We will show that the 
number of “informative” (to be defined explicitly below) eigenvalues of these matrices directly esti¬ 
mates the number of communities, and the estimate performs well under different network models 
and over a wide range of parameters, outperforming existing methods that are designed specifically 
for finding K under either SBM or DCSBM. This method is also very computationally efficient, 
since all it requires is computing a few leading eigenvalues of just one typically sparse matrix. 


2 Preliminaries 

Recall A is the n x n symmetric adjacency matrix. Let di = be the degree of node i. 

Treating A as a random matrix, we denote by A the expectation of A, and by A„ the average of 
expected node degrees, A„ = ^ X]r=i ® ^ symmetric matrix X, let Pk{X) the fc-th largest 

eigenvalue of X. We say that a property holds with high probability if the probability that it occurs 
tends to one as n —oo. Next, we recall the definitions of the non-backtracking and the Bethe 
Hessian matrices which we will use to estimate the number of communities. 


2.1 The non-backtracking matrix 


Let m be the number of edges in the undirected network. To construct the non-backtracking matrix 
B, we represent the edge between node i and node j by two directed edges, one from i to j and the 
other from j to i. The 2m x 2m matrix B, indexed by these directed edges, is defined by 

„ _ J 1 if j = k and i A I 

i^ 3 ,k^i “ I 0 Otherwise. 

It is well-known ||4llfT8ll that the spectrum of B consists of ±1 and eigenvalues of an 2n x 2n matrix 


B = 


0„ D - In \ 

-In A J- 


Here 0„ is the n x n matrix of all zeros, is the n x n identity matrix, and D = diag(fii) is 
n X n diagonal matrix with degrees di on the diagonal. It was observed in IT^ that if a network 
has K communities then the first K largest eigenvalues in magnitude of B are real-valued and well 
separated from the bulk, which is contained in a circle of radius ||i3||^/^. We will refer to these K 
eigenvalues as informative eigenvalues of B. It was also shown in ifTSl that the spectral norm of the 
non-backtracking matrix is approximated by 


n _ 1 ^ 

2=1 2=1 


( 1 ) 


For the special case of the SBM, (nni proved that the leading eigenvalues of B concentrate around 
non-zero eigenvalues of A and the bulk is contained in a circle of radius and used the 

corresponding leading eigenvectors to recover the community labels. 


2.2 The Bethe Hessian matrix 

The Bethe Hessian matrix is defined as 

H{r) = {r^-l)I-rA + D, (2) 
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where r G R is a parameter. In graph theory, the determinant of H{r) is the Ihara-Bass formula for 
the graph zeta function. It vanishes if r is an eigenvalue of the non-backtracking matrix lEllllll. 
Saade et al ||2^ used the Bethe Hessian for community detection. Under the SBM, they argued that 
the best choice of r is | rc | = y/X^, depending on whether the network is assortative or disassortative; 
for a more general network, their choice of r is jrd = ||i?|d/^. For assortative sparse networks with 
K communities and bounded A„, they showed that the K eigenvalues of H (r^,) whose corresponding 
eigenvectors encode the community structure are negative, while the bulk of H{rc) are positive. 
Thus, the number of negative eigenvalues of H{rc) corresponds to the number of communities. We 
will refer to these negative eigenvalues of H{tc) as informative eigenvalues. 

3 Spectral estimates of the number of communities 

The spectral properties of the non-backtracking and the Bethe Hessian matrices lead to natural es¬ 
timates of the number of communities, but they have not been previously considered specifically 
for this purpose. We now propose two methods (one for each matrix) to determine the number of 
communities K. 

3.1 Estimating K from the non-backtracking matrix 

Under the SBM, the informative eigenvalues of the non-backtracking matrix are real-valued and 
separated from the bulk of radius ifTOl . Therefore we can estimate K by counting the 

number of real eigenvalues of B that are at least ||B|d/^. We denote this method by NB (for non¬ 
backtracking). As shown by numerical results in Section this estimate of K also works under 
the DCSBM. When the network is balanced (communities have similar sizes and edge densities), 
NB performs well; however, the accuracy of NB drops if the communities are unbalanced in either 
size or edge density. Computationally, since B is not symmetric, computing the eigenvalues of B is 
more demanding for large networks. Thus we focus instead on the Bethe Hessian matrix, which is 
symmetric. 

3.2 Estimating K from the Bethe Hessian matrix 

The number of communities corresponds to the number of negative eigenvalues of iF(r); the chal¬ 
lenge is in choosing an appropriate value of r. 

It was argued in 1^ that when r = ||i3||^/^, the informative eigenvalues of H{r) are negative, while 
the bulk are positive; by ifTSl . ||i?|| can be approximated by d from ([T]). Following these results, we 
first choose r to be and denote the corresponding method by BHm. Simulations show 

that using r = and r = ||i3|| produce similar results; we choose r = Vm because computing 
Tm is less demanding than computing ||i3|| 

Another choice of r is {di -f • • • -f d„)/n, which was proposed in ll29l for recovering the 

community structure under the SBM; we denote the corresponding method by BHa. We have found 
that when the network is balanced , NB, BHm, and BHa perform similarly; when the network is 
unbalanced, BHa produces better results. 

Both BHm and BHa tend to underestimate the number of communities, especially when the network 
is unbalanced. In that setting, some informative eigenvalues of H{r) become positive, although 
they may still be far from the bulk. Based on this observation, we correct BHm and BHa by also 
using positive eigenvalues of H{r) that are much close to zero than to the bulk. Namely, we sort 
eigenvalues of H{r) in non-increasing order pi > P2 > ■ ■ ■ ^ Pn, and estimate K by 

k = max{fc : tp^-k+i < Pn-k}, (3) 

where f > 0 is a tuning parameter. Note that if pn-ko+i < 0 then k > ko because pn-ko+i < 
Pn-ko’ therefore the number of negative eigenvalues of H (r) is always upper bounded by k. Heuris- 
tically, if the bulk follows the semi-circular law and pn-k > 0 is given, then the probability that 
0 < pn-k+i < Pn-fe/i is less than 1/t. When 1/t is sufficiently small, we may suspect that pn-k+i 
is an informative eigenvalue. In practice we find that t G [4, 6] works well; we will set f = 5 for 


3 



all computations in this paper. Simulations show that K performs well, especially for unbalanced 
networks. The resulting methods are denoted by BHmc and BHac, respectively. We will also use 
BH to refer to all the methods that use the Bethe Hessian matrix. 


4 Consistency 

The consistency of the non-backtracking matrix based method (NB) for estimating the number of 
communities in the sparse regime under the stochastic block model follows directly from Theo¬ 
rem 4 in Qo). We state this consistency result here for completeness. The proof given in Qo) is 
combinatorial in nature and this approach unfortunately does not extend to any other regimes or the 
Bethe-Hessian matrix. 

Theorem 4.1 (Consistency in the sparse regime). Consider a stochastic block model with tt = 
(tti, ...,'Kk) and P = {Pm) = ^P^^^ for some fixed K x K symmetric matrix P^^\ Assume that 
(diag(7r)P)’’ has positive entries for some positive integer r. Further, assume that E di = A„ > 1 
for all i, and all K non-zero eigenvalues of P are greater than y/Xf. Then with probability tending 
to one as n ^ oo, the number of real eigenvalues of B that are at least || P|| is equal to K. 

To better understand the condition on the eigenvalues of P, consider the simple model G{n, f;, ^)- 
This model assumes that there are two communities of equal sizes and nodes are connected with 
probability a/n if they are in the same community, and b/n otherwise. Since the two non-zero 
eigenvalues of P are (a-|-6)/2and (a—6)/2, the condition on eigenvalues of Pis {a—b)^ > 2 (a-1-6). 

For the Bethe Hessian, no formal results have been previously established that can be applied 
directly. We will show that both BHm and BHa methods produce consistent estimator of K = 
rank(A) in the dense regime when A„ grows linearly in n, under the inhomogeneous Erdos-Renyi 
model with edge probability matrix A (see |(9l), which includes as a special case the stochastic block 
model with K communities. The inhomogeneous Erdos-Renyi model assumes that edges are drawn 
independently between nodes i and j with probabilities Aij . Let 

do = min Edi, d = max nAij. 

i,3 

It is clear that do < Xn < d. Eor the simple model G{n, |) we have dg = A„ = d = (a -f 6)/2. 

Theorem 4^2 (Consistency in the dense regime). Consider an inhomogeneous Erdos-Renyi model 
with rank(A) = K such that 

Pk{A) >f>d/\fdo, and do > (1-f e)d(l — d/n) 

for some constant e > 0. Then with high probability, the Bethe Hessian H(r) with r = Cm or 
r = Ta has exactly K negative eigenvalues. 


Proof Let us first rewrite the Bethe Hessian as 

H{r) = (r^ — 1)/ — r{A — A) -\- D — rA =: H{r) — rA. 

We will show that eigenvalues of H{r) are non-negative and are of smaller order than non-zero 
eigenvalues of rA. This in turn implies that K eigenvalues of H (r) are negative while the rest are 
positive. 

To bound A — A, sno use the concentration result in OTI : with high probability, 

\\A- A\\ <2^/d{^AAAJn) + GorA^‘^\ogn, (4) 

for some constant Co > 0. To bound the node degrees, we use the standard Bernstein’s inequality: 
there exists a constant Ci > 0 such that, with high probability, 

\\D - EP)|| < Cii/dlogn, |r^ - A„| < Ciy/dlogn. (5) 

Eor square matrices X, Y we use X > Y to signify that X — Y m semi-positive definite. Since 
El? P do/, from (|4]l, 0, and the assumption that do > (1 -f e)d{l — d/n), we obtain 


H{r)Y do -f An - 2y/ A„d(l - d/n) -f o(d) 


IPO. 


( 6 ) 
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For a subspace U C R", we denote by dim(C7) the dimension of U, and by the orthogonal 
complement of U. Let col(A) be the column space of A. Using the Courant min-max principle (see 
e.g. E Corollary III. 1.2]) and (|6]l, we have 


Pn-K{H{r)) 


max min {H{r)x,x) > min {H(r)x,x) > 0. 

dim{U)=n—K xeU,\\x\\ = l x£col(A)^ , ||x|| = l 


Therefore the n — K largest eigenvalues of H{r) are non-negative. 

It remains to show that the K smallest eigenvalues of H{r) are negative. From (HJl, (|5]l, and the 
triangle inequality, we obtain 

< \n + d + 2d\/l — djn + o{d) < 4d. (7) 

On the other hand, from (|5]l and the assumption pk{A) > bd/\fdA, we have 

PK[rA) > [l + o(l)] X]J'^pk[A) > 4d. (8) 

Combining O, (|8]l, and using the Courant min-max principle again implies that K smallest eigen¬ 
values of H(r) are negative, which completes the proof. □ 


More work is needed on the case of “intermediate” rate of A„ not covered by either of the theorems, 
which will require fundamentally different approaches. This is a topic for future work. 


5 Numerical results 

Here we empirically compare the accuracy in estimating the number of communities using the non¬ 
backtracking matrix (NB), and all the versions based on the Bethe Hessian matrix (BHm, BHmc, 
BHa, and BHac), described in Sections 13.11 an d3.2l We compare them with two other methods 
proposed specifically for estimating the number of communities in networks: the network cross- 
validation method (NCV) proposed by IfT^ and a likelihood-based BIC-type method (VLH, for 
variational likelihood) proposed by ll^ . We use NCVbm and NCVdc to denote the versions of 
the NCV method specifically designed for the SBM and the DCSBM, respectively; VLH is only 
designed to work under the SBM, so it is not included in the DCSBM comparisons. To make 
comparisons with VLH computationally feasible, instead of using the variational method to estimate 
the posterior of the community labels as done in ll^ . we estimate the node labels by the pseudo¬ 
likelihood method proposed by 0 and then compute the posterior following ||3?1 . In small-scale 
simulations where both approaches are computationally feasible (results omitted) we found that 
substituting pseudo-likelihood for the variational method has very little effect on the estimate of K. 
The tuning parameter of VLH is set to one (following j^ ). We do not include the method of Q in 
these comparisons due to its high computational cost. 

5.1 Synthetic networks 

To generate test case networks, we fix the label vector c G {1,..., iC}" so that Ci = k if mrk-i + 1 < 
i < riTTk, where ttq = 0. The label matrix Z G encodes c by representing each node with 

a row of K elements, exactly one of which is equal to 1 and the rest are equal to 0: Zik = 1 (cj = 
k). Let P he an K x K matrix with diagonal w = (tui, ...,wk) and off-diagonal entries /3, and 
M = ZPZ^. Under the stochastic block model, we generate A according to an edge probability 
matrix ^ = E ^ proportional to M ; the average degree An, is controlled by appropriately rescaling 
M. The parameter w controls the relative edge densities within communities, and (3 controls the 
out-in probability ratio. Smaller values of /3 and larger values of A„ make the problem easier. 
For the DCSBM, we generate the degree parameters 9i from a distribution that takes two values, 
P(6* = 1) = 1 — 7 and P(6* = 0.2) = 7. Parameter 7 controls the fraction of “hubs”, the high-degree 
nodes in the network, and setting 7 = 0 gives back the regular SBM. Given 0 = {9i ,..., 9n), the 
edges are generated independently with probabilities A — E A proportional to diag(0)Mdiag(0), 
where diag(0) is a diagonal matrix with 9i’s on the diagonal. 

The number of nodes is set to n = 1200, the out-in probability ratio /3 = 0.2, and we vary the 
average degree A„, weights w, and community sizes. We consider three different values for the 
number of communities, K = 2, 4, and 6. For each setting, we generate 200 replications of the 
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network and record the accuracy, defined as the fraction of the times a method correctly estimates 
the true number of communities K. The methods NCV and VLH require a pre-specified set of 
K values to choose from; we use the set {1, 2,8} for synthetic networks and {1, 2,15} for 
real-world networks. 

We start by varying the average degree A„, which controls the overall difficulty of the problem, and 
keeping all community sizes equal. Figure [T] shows the performance of all methods when all edge 
density weights are also equal, Wi = 1 for all 1 < z < Ff; in Figure|2l w = (1, 2) for K = 2, 
w = (1,1, 2, 3) for K = A, and w = (1,1,1,1, 2,3) for K = 6, resulting in communities with 
varying edge density. In all figures, the top row corresponds to the SBM (7 = 0) and the bottom 
row to the DCSBM (7 = 0.9, which means that 10% of nodes are hubs). 

In general, we see that when everything is balanced (Figure [T]i, all spectral methods perform fairly 
similarly and outperform both cross-validation (NCV) and the BIC-type criterion (VLH). Also, for 
larger K and especially under DCSBM, we can see that the corrected versions are slightly better than 
the uncorrected ones, and the best Bethe Hessian based methods are better than the non-backtracking 
estimator. 

For networks with equal size communities but different edge densities within communities (Fig¬ 
ure EJ, cross-validation performs poorly, but VLH relatively improves. For larger K the spectral 
methods are also distinguishable, with all BH methods dominating NB, and corrected versions pro¬ 
viding improvement. Overall, BHac is the best spectral method, comparable to VLH for the SBM, 
and best overall for DCSBM where VLH is not applicable. 


K=2 K=4 K=6 






Figure 1; The accuracy of estimating AT as a function of the average degree. All communities have 
equal sizes, and Wi = 1 for all 1 < z < Ff. 


Communities of different sizes present a challenge for community detection methods in general, and 
the presence of relatively small communities makes the problem of estimating K difficult. To test 
the sensitivity of all the methods to this factor, we change the proportions of nodes falling into each 
community setting tti = r/K, ttk = (2 — r)/A', and tt^ = 1/K for 2 < z < AT — 1, and varying r in 
the range [0.2,1]. As r increases, the community sizes become more similar, and are all equal when 
r = 1. Figure [ 3 ] shows the performance of all methods as a function of r. The top row corresponds 
to the SBM (7 = 0), the bottom row to the DCSBM (7 = 0.9), and the within-community edge 
density parameters Wi = 1 for all 1 < z < AT. Here we see that VLH is less sensitive to r than the 
spectral methods, but unfortunately it is not available under the DCSBM. Cross-validation is still 
dominated by spectral methods except for very small values of r, where all methods perform poorly. 
The corrections still provide a slight improvement for Bethe Hessian based methods, although all 
spectral methods perform fairly similarly in this case. 
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Figure 2: The accuracy of estimating K as a function of the average degree. All communities have 
equal sizes; w = (1,2) for AT = 2, rc = (1,1, 2, 3) for K = A, and w = (1,1,1,1, 2, 3) for K = 6. 


K = 2 K = 4 




Community-size ratio Community-size ratio 


K = 6 



Community-size ratio 




Community-size ratio Community-size ratio 


K = 6 



Community-size ratio 


Figure 3: The accuracy of estimating K as a function of the community-size ratio r: tti = rjK, 
TTji = {2 — r)/K, and = 1/K for 2 < i < K — 1. In all plots, Wi = 1 for 1 < i < K; the 
average degrees are A„ = 10 (left), 15 (middle), and 20 (right). 


5.2 Real world networks 

Finally, we test the proposed methods on several popular network datasets. In the college football 
network M, nodes represent 115 US college football teams, and edges represent the games played 
in 2000. Communities are the 12 conferences that the teams belong to. The political books net¬ 
work Il24ll . compiled around 2004, consists of 105 books about US politics; an edge is “frequently 
purchased together” on Amazon. Communities are “conservative”, “liberal”, or “neutral”, labelled 
manually based on contents. The dolphin network 1211 is a social network of 62 dolphins, with 
edges representing social interactions, and communities based on a split which happened after one 


7 



































dolphin left the group. Similarly, the karate club network ll34ll is a social network of 34 members 
of a karate club, with edges representing friendships, and communities based on a split following a 
dispute. Finally, the political blog network HI, collected around 2004, consists of blogs about US 
politics, with edges representing web links, and communities are manually assigned as “conserva¬ 
tive” or “liberal”. For this dataset, as is commonly done in the literature, we only consider its largest 
connected component of 1222 nodes. 

Table [T] shows the estimated number of communities in these networks. All spectral methods esti¬ 
mate the correct number of communities for dolphins and the karate club, and do a reasonable job 
for the college football and political books data. For political blogs, all methods but NCV and VLH 
estimate a much larger number of communities, suggesting the estimates correspond to smaller sub¬ 
communities with more uniform degree distributions that have been perviously detected by other 
authors. We also found that the VLH method was highly dependent on the tuning parameter, and 
the estimates of NCVbm and NCVdc varied noticeably from run to run due to their use of random 
partitions. 


Dataset 

NB 

BHm 

BHmc 

BHa 

BHac 

NCVbm 

NCVdc 

VLH 

Truth 

College football 

10 

10 

10 

10 

10 

14 

13 

9 

12 

Political books 

3 

3 

4 

4 

4 

8 

2 

6 

3 

Dolphins 

2 

2 

2 

2 

2 

4 

3 

2 

2 

Karate club 

2 

2 

2 

2 

2 

3 

3 

4 

2 

Political blogs 

8 

7 

8 

7 

8 

10 

2 

1 

2 


Table 1: Estimates of the number of communities in real-world networks. 


6 Discussion 

In summary, the numerical experiments suggest that the spectral methods provide extremely fast and 
reliable estimates of the number of communities K for balanced networks, with the Bethe Hessian 
based method with the threshold choice and the correction described in (O the best choice in 
most scenarios. With communities of significantly different sizes, they tend to underestimate K 
by combining small communities together, which seems to be an intrinsic limitation of spectral 
methods. This suggests that their estimates can be used as a lower bound on K and a starting 
point for a more elaborate and computationally demanding likelihood-based method like VLH, in 
the same way that spectral clustering can be used to initialize a more sophisticated community 
detection method. Having a small set of plausible values of K to focus on can significantly reduce 
the computational cost and improve the accuracy of estimating the number of communities. 
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